fix(CubeMaster): expire node health after stale nodemeta heartbeat by zyl1121 · Pull Request #455 · TencentCloud/CubeSandbox

zyl1121 · 2026-06-04T02:14:04Z

Motivation

On the nodemeta path, CubeMaster can keep a node healthy after cubelet stops reporting heartbeats because the last reported Ready=True fact is reused as the current health result.

This makes stale nodes look healthy in control-plane reads and can leave them eligible for scheduling decisions longer than expected.

What this PR changes

derive current node health from the last reported readiness fact plus heartbeat freshness
make nodemeta reads return the current effective healthy result instead of a stale cached value
make healthy=false explicit in API output
expose unhealthy_reason for HeartbeatExpired and ReportedNotReady
make localcache read paths refresh health from cached facts without mutating shared cached nodes
apply the same health refresh logic to GetNodesByIp
make prefilter skip stale metadata / metric nodes instead of only logging timeout

Testing

go test -vet=off ./pkg/localcache ./pkg/nodemeta ./pkg/selector/prefilter

Manual validation:

stale heartbeat is returned as healthy=false
unhealthy_reason distinguishes HeartbeatExpired and ReportedNotReady
/internal/meta/nodes/{id} and /internal/node reflect the same health semantics

Documentation

No standalone documentation update is included in this PR. However, this change does affect the JSON shape of node responses: healthy=false is now emitted explicitly instead of being omitted.

Related issue

Fixes #454

cubesandboxbot · 2026-06-04T02:14:32Z

Claude finished @zyl1121's task in 1m 48s —— View job

✅ Code Review — PR #455

Progress

Gather context and understand the PR
Read all changed files
Analyze code for correctness, bugs, and edge cases
Check test coverage and completeness
Post review feedback

Overall Assessment

This is a clean, well-structured fix that correctly addresses the stale-node-health problem. The architecture is sound, the tests are thorough, and the prefilter hardening (Fatalf → Warnf + continue) is a significant reliability improvement.

Decision: APPROVE with minor observations — no blocking issues found.

Design Review

1. Clean separation of concerns (`nodehealth` package) ✅

The new pkg/base/nodehealth/health.go correctly encapsulates:

Status struct combining Healthy + UnhealthyReason
EvaluateFromFacts() — derives health from the four facts (reportedReady, heartbeatTime, now, timeout)
ReadyConditionTrue() — extracts Kubernetes Ready condition

This makes the evaluation logic independently testable and reusable across localcache, nodemeta, and any future consumer.

2. Immutability on read paths ✅

All read paths now clone before mutating health state:

localcache/export.go:cloneNodeWithCurrentHealth() — clones the node.Node, then applies health
nodemeta/service.go:cloneSnapshotWithCurrentHealth() — same pattern for NodeSnapshot

Both TestGetNodeRefreshesCurrentHealthFromCachedFacts (export_test.go:270) and TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat (export_test.go:309) explicitly verify that cached source nodes are not mutated during reads. This prevents a class of subtle race-condition bugs.

3. Prefilter hardening ✅

The change from Fatalf to Warnf + continue in prefilter.go:77-94 is the most impactful reliability improvement. Previously, a stale metric or metadata update would crash the entire scheduler via Fatalf. Now it gracefully skips the stale node and continues. This alone prevents cascading failures when node heartbeats lag.

4. Proper JSON serialization ✅

Healthy removed omitempty → now serializes "healthy":false explicitly (consumers can distinguish "unknown" from "unhealthy")
UnhealthyReason with omitempty → only present when there's a reason
ReportedReady with json:"-" → internal field, never serialized

This is validated by TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady in service_resource_test.go:145.

Potential Issues & Observations

1. Duplicated timeout constant (minor)

metadataHealthTimeout() in export.go:166 and healthTimeout() in service.go:388 compute the same value: SyncMetaDataInterval + 10*time.Second. If one is updated independently, they'll drift. Consider extracting to a shared helper.

2. Lock scope in `downNodeCache` (very minor)

In node_cache.go:205-217, the lockMetaData mutex is released before delSortedNodes(n) is called. This was the case before the PR, so it's not a regression, but there's a tiny window between the unlock and the sorted-nodes deletion where a reader iterating sortedNodesByClusters could see a node that was just marked ReportedReady=false. In practice this is harmless since the next cycle will correct it.

3. Fast path removed in `GetHealthyNodesByInstanceType`

The PR removes the old fast path:

// old
if n == -1 {
    return clusterNodes
}

The old code returned raw references from the sorted cluster cache, which was efficient but allowed stale Healthy values to be used. The new code always clones and re-evaluates health, which is correct but has a performance cost on every call. This is the right trade-off — health correctness > micro-optimization.

4. `appendSortedNodes` reads `n.Healthy` without health recomputation

In node_cache.go:231, appendSortedNodes checks n.Healthy directly rather than recomputing it. This is fine for current callers because:

From addNodeCache/constructNode: Healthy was just set from the DB live-status
From updateNodeFromMetaData: Healthy was just set from n.Healthy on line 299
But it means if someone calls addNodeCache with a node whose Healthy is out of date, the health-based filtering in appendSortedNodes will use that stale value. The node will still be in the cache, so subsequent reads through GetNodes, GetNode, etc. will re-evaluate correctly — it's only the sortedNodesByClusters membership that could lag.

Test Coverage

Test	File	What it covers
`TestGetNodeRefreshesCurrentHealthFromCachedFacts`	`export_test.go:270`	Stale heartbeat → `Healthy=false`, source not mutated
`TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat`	`export_test.go:309`	Fresh vs stale node in cluster list
`TestGetNodesByIpRefreshesCurrentHealthFromCachedFacts`	`export_test.go:343`	Health recomputation in IP-based lookup
`TestNodeConcurrentCountersUpdateCachedNodeFromReadClone`	`export_test.go:371`	Incr/Decr routing to cached node
`TestCurrentHealthStatusExpiresStaleHeartbeat`	`service_resource_test.go:55`	Fresh + stale heartbeat scenarios
`TestUpdateNodeStatusStoresLastReportedReadyButExportsCurrentHealth`	`service_resource_test.go:89`	ReportedReady preserved, Healthy recomputed
`TestCloneSnapshotWithCurrentHealthRefreshesStaleNode`	`service_resource_test.go:121`	Cloned snapshot has correct health, source unchanged
`TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady`	`service_resource_test.go:145`	JSON serialization correctness
`TestPreFilterExcludesExpiredHeartbeatNode`	`prefilter_test.go:34`	Prefilter skips stale metadata nodes
`TestPreFilterExcludesMetricTimeoutNode`	`prefilter_test.go:74`	Prefilter skips stale metric nodes

Missing (non-blocking): No test for GetNodes() or GetHealthyNodes() health refresh, though they use the same cloneNodeWithCurrentHealth path tested elsewhere.

Summary

The PR correctly fixes the stale heartbeat problem, is well-tested, and introduces no regressions. The design cleanly separates concerns, read paths are immutable, and the prefilter changes significantly improve scheduler robustness. No blocking issues found.

cubesandboxbot · 2026-06-04T02:55:55Z

Claude finished @zyl1121's task in 2m 45s —— View job

Code Review — PR #455

Progress

Analyze context and understand the request
Read all changed files
Analyze code for correctness, bugs, and edge cases
Check test coverage and completeness
Post review feedback

Overall Assessment

This PR correctly addresses the stale-node-health problem described in #454. The architecture is clean, the design is well-isolated, and tests are thorough. The Fatalf → Warnf + continue change in prefilter is a significant reliability improvement.

Decision: APPROVE — no blocking issues found. Observations below are minor and can be tracked separately.

Design Review

1. `nodehealth` package — clean abstraction ✅

pkg/base/nodehealth/health.go encapsulates the health evaluation logic:

Status struct with Healthy + UnhealthyReason
EvaluateFromFacts(reportedReady, heartbeatTime, now, timeout) — pure function from four inputs to one output
Evaluate(conditions, heartbeatTime, now, timeout) — convenience wrapper that extracts ReadyConditionTrue

This is independently testable and reusable across localcache, nodemeta, and any future consumer. The two exported reason constants (ReasonReportedNotReady, ReasonHeartbeatExpired) make the API self-documenting.

2. Read-path immutability ✅

Both localcache and nodemeta now clone nodes before mutating health on read:

localcache/export.go:170-179 — cloneNodeWithCurrentHealth() clones via n.Clone(), then applies health evaluation from cached facts
nodemeta/service.go:488-492 — cloneSnapshotWithCurrentHealth() clones via cloneSnapshot(), then applies applyCurrentHealth()

All read paths (GetNodes, GetNode, GetNodesByIp, GetHealthyNodesByInstanceType, GetNode in nodemeta, ListNodes) use these helpers. Tests at export_test.go:296-306 and service_resource_test.go:121-143 explicitly verify source nodes are not mutated during reads.

3. Prefilter hardening ✅

In prefilter.go:78-94, the change from Fatalf to Warnf + continue for metadata and metric timeouts means stale nodes no longer crash the scheduler. This was the most impactful reliability win.

The Fatalf at line 73-77 for CpuLoadUsage > CpuTotal is correctly left unchanged — that condition represents a data integrity issue that warrants a crash.

4. JSON serialization ✅

node.go changes:

Healthy removed omitempty → "healthy":false is now explicit in JSON
UnhealthyReason keeps omitempty → only serialized when non-empty
ReportedReady uses json:"-" → never serialized

The NodeSnapshot in nodemeta/service.go:98-118 mirrors these correctly.

Observations

1. `metadataHealthTimeout()` / `healthTimeout()` duplication

export.go:166-168 and service.go:388-390 both compute the same value: SyncMetaDataInterval + 10s. Both delegate to nodehealth.MetadataTimeout(), so the formula itself is shared — but they are separate wrapper functions. If the timeout derivation logic changes beyond the function call, only one wrapper might be updated. Minor, but worth tracking.

2. `appendSortedNodes` uses cached `Healthy` directly

At node_cache.go:230-233, appendSortedNodes gates on n.Healthy rather than recomputing health from facts:

func (l *local) appendSortedNodes(n *node.Node) {
    if n == nil || !n.Healthy {
        return
    }

This means:

For nodes coming from toSchedulerNode (nodemeta path): Healthy was just computed from currentHealthStatus(), so it's correct.
For nodes from constructNode (legacy DB path): Healthy equals constants.HeartbeatHealth == elem.LiveStatus, which is the persisted state rather than a freshness-based computation. A stale node loaded from DB with LiveStatus == HeartbeatHealth will be added to sortedNodesByClusters.

This is not a correctness bug because GetHealthyNodesByInstanceType recomputes health on every read via cloneNodeWithCurrentHealth() — so stale nodes in sortedNodesByClusters are filtered out during reads. The node appears in the sorted list but is excluded by the health filter. The impact is limited to a slightly larger sortedNodesByClusters list than necessary.

3. Redundant health checks in prefilter

prefilter.go:48-53 already skips !n.Healthy nodes. Then lines 90-95 check MetaDataUpdateAt timeout separately. Since Healthy is recomputed from MetaDataUpdateAt + timeout, the !n.Healthy check should catch stale-heartbeat nodes. The timeout check at line 90 acts as defense-in-depth in case a stale node somehow has Healthy=true (e.g., race with an update). The redundancy is acceptable.

4. `downNodeCache` mutates cached node in-place

node_cache.go:205-212 mutates the cached node directly with lockMetaData held, then releases the lock before calling delSortedNodes(n):

l.lockMetaData.Lock()
old := v.(*node.Node)
old.ReportedReady = false
old.Healthy = false
...
l.lockMetaData.Unlock()
l.delSortedNodes(n)

The mutation is correct (protected by lockMetaData), and delSortedNodes has its own lockSortedNodes lock. The brief window between Unlock and delSortedNodes is harmless — sortedNodesByClusters may include this node for an instant longer, but the next DB sync cleans it up. Not a regression.

5. `incrementNodeCacheCounts` in `node_cache.go` — consistent update

At node_cache.go:372, IncrNodeConcurrent calls getMutableNode(n), which fetches the cached node by n.ID(). Since read paths return a clone, IncrNodeConcurrent correctly routes to the cached reference. This means the clone/read pattern doesn't break concurrent counters.

Test Coverage

Test	File	What it covers
`TestGetNodeRefreshesCurrentHealthFromCachedFacts`	`export_test.go:270`	Stale heartbeat → `Healthy=false`, source not mutated
`TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat`	`export_test.go:309`	Fresh vs stale in cluster list, source not mutated
`TestGetNodesByIpRefreshesCurrentHealthFromCachedFacts`	`export_test.go:343`	Health recomputation on IP-based lookup
`TestNodeConcurrentCountersUpdateCachedNodeFromReadClone`	`export_test.go:371`	`IncrNodeConcurrent`/`DecrNodeConcurrent` route to cached node
`TestCurrentHealthStatusExpiresStaleHeartbeat`	`service_resource_test.go:55`	Fresh + stale heartbeat scenarios
`TestUpdateNodeStatusStoresLastReportedReadyButExportsCurrentHealth`	`service_resource_test.go:89`	`ReportedReady` preserved, `Healthy` recomputed
`TestCloneSnapshotWithCurrentHealthRefreshesStaleNode`	`service_resource_test.go:121`	Cloned snapshot health correct, source unchanged
`TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady`	`service_resource_test.go:145`	JSON serialization
`TestPreFilterExcludesExpiredHeartbeatNode`	`prefilter_test.go:34`	Prefilter skips stale metadata node
`TestPreFilterExcludesMetricTimeoutNode`	`prefilter_test.go:74`	Prefilter skips stale metric node
`TestToSchedulerNodeDoesNotForgeMetricUpdate`	`service_resource_test.go:33`	Metric fields left zero in nodemeta→scheduler conversion
`TestSyncNodeTemplatesReconcilesHeartbeatState`	`export_test.go:160`	Template reconciliation during heartbeat

Coverage is thorough. Missing (non-blocking): no direct test for GetNodes() or GetHealthyNodes() health refresh, though both use the same cloneNodeWithCurrentHealth path.

Summary

The PR correctly fixes the stale-heartbeat problem with a clean architecture, immutable read paths, and comprehensive tests. The preloader reliability improvements (Fatalf → continue) are significant. No blocking issues.

fslongjin · 2026-06-04T07:49:02Z

+// Copyright (c) 2026 Tencent Inc.
+// SPDX-License-Identifier: Apache-2.0
+//


should not add this line

fslongjin · 2026-06-04T07:49:09Z

+// Copyright (c) 2026 Tencent Inc.
+// SPDX-License-Identifier: Apache-2.0
+//
+


should not add this line

should not add this line

@fslongjin Removed the extra header comments in a follow-up commit. Thanks for catching that.

cubesandboxbot · 2026-06-04T08:08:39Z

Claude finished @zyl1121's task in 1m 1s —— View job

Re-review after commit `10fcd30` — Extra header comments removed

Progress

Verify that extra header comments have been removed
Confirm files are correct after the fix

✅ Fix verified

The follow-up commit 10fcd30 ("fix(CubeMaster): drop extra header comments") correctly addresses both comments from @fslongjin:

CubeMaster/pkg/base/nodehealth/health.go — now starts directly with package nodehealth (no blank copyright/SPDX header above the package declaration).
CubeMaster/pkg/selector/prefilter/prefilter_test.go — same fix applied, starts directly with package prefilter.

Both files now match the project's convention (as seen in prefilter.go:1-6 and export.go:1-6), where the copyright header comment sits above a package-level doc comment, not as a floating block on its own.

No other changes were detected in this push. The rest of the PR's code, tests, and design remain as reviewed in the previous round.

Decision: APPROVE — no issues remain.

fslongjin

Thanks for this PR — the facts-vs-evaluation separation (ReportedReady vs Healthy) and the centralized nodehealth package are clean architectural improvements. I left a few inline comments on areas that could be tightened up.

Summary of inline comments

prefilter.go — Redundant MetaDataUpdateAt timeout check; GetHealthyNodesByInstanceType already expires stale heartbeats via cloneNodeWithCurrentHealth.
node.go:Clone() — Struct copy reads LocalCreateNum non-atomically (data race per Go memory model).
service.go — Duplicate health evaluation in UpdateNodeStatus (line 239) and toSchedulerNode (line 412); two time.Now() calls on the same snapshot.
node.go:Healthy JSON tag — omitempty removed; API contract change worth calling out explicitly.
service.go:RegisterNode — Returns cloneSnapshot without applying current health, inconsistent with GetNode/ListNodes/UpdateNodeStatus.

fslongjin · 2026-06-04T08:23:42Z

 				"CalleeCluster": n.ClusterLabel,
-			}).Fatalf("%s MetaDataUpdate timeout,lastupdate:%v", n.IP, n.MetaDataUpdateAt)
+			}).Warnf("%s MetaDataUpdate timeout,lastupdate:%v", n.IP, n.MetaDataUpdateAt)
+			continue


🔁 Redundant heartbeat-timeout check

GetHealthyNodesByInstanceType (line 40) already clones every node with cloneNodeWithCurrentHealth, which calls nodehealth.EvaluateFromFacts using the same MetadataTimeout. A node whose heartbeat has expired will already have Healthy=false and will have been excluded from the candidate set before this loop runs.

This second MetaDataUpdateAt guard therefore duplicates what the health evaluation already does. Over time, keeping two separate timeout calculations risks the two checks drifting (e.g. someone changes how cloneNodeWithCurrentHealth computes the timeout but forgets to update this prefilter copy, or vice versa).

Suggestion: Rely solely on the n.Healthy check above (line 49). That gives a single source of truth for heartbeat-liveness decisions.

fslongjin · 2026-06-04T08:23:42Z

+		cloned.VirtualNodeQuotaArray = append([]int64(nil), n.VirtualNodeQuotaArray...)
+	}
+	return &cloned
+}


⚠️ Non-atomic read of LocalCreateNum

The struct copy cloned := *n reads every field including LocalCreateNum (an int64), which is accessed concurrently via atomic.AddInt64 (in LocalCreateNumIncrBy) and atomic.LoadInt64 (in LocalCreateConcurrentLimit).

Per the Go memory model this is a data race and would be flagged by go test -race. In practice on amd64 it is safe (aligned 8-byte reads are atomic at the hardware level), but on 32-bit platforms (ARM) this could tear.

If this project only targets amd64, a brief comment acknowledging the deliberate non-atomic copy would help future readers (and suppress race-detector noise). Otherwise, consider using atomic.LoadInt64(&n.LocalCreateNum) for the copy.

fslongjin · 2026-06-04T08:23:42Z

@@ -419,7 +438,9 @@ func toSchedulerNode(snap *NodeSnapshot) *node.Node {
 		OssClusterLabel:     snap.ClusterLabel,


🔄 Duplicate health evaluation with UpdateNodeStatus line 239

UpdateNodeStatus already calls applyCurrentHealth(snap, time.Now()) (line 239, position 46 in this diff), which runs currentHealthStatus internally. syncLocalcache on line 241 then calls toSchedulerNode, which calls currentHealthStatus(snap, time.Now()) again here.

The two time.Now() calls are microseconds apart, so the result is effectively identical — but it is redundant work and creates a maintenance risk: if someone later extends the health-decision logic in one call site but not the other, the two representations of the same snapshot's health will diverge.

Suggestion: Have applyCurrentHealth return the nodehealth.Status it computed (e.g. func applyCurrentHealth(...) nodehealth.Status), and let toSchedulerNode consume it instead of re-deriving from scratch. This keeps a single evaluation per heartbeat update.

fslongjin · 2026-06-04T08:23:43Z

+	UnhealthyReason string `json:"UnhealthyReason,omitempty"`

 	Score float64 `json:"Score,omitempty"`



📢 API contract change: omitempty removed

The Healthy field changed from json:"Healthy,omitempty" to json:"Healthy", meaning "healthy":false is now always present in JSON output instead of being omitted. This is deliberate (the nodemeta test confirms it), but it is a wire-format change that existing API consumers may depend on.

Consider calling this out in the PR description and the next changelog entry so integrators know to expect the new field shape.

fslongjin · 2026-06-04T08:23:43Z

+	applyCurrentHealth(out, now)
+	return out
+}
+


❓ Inconsistency: RegisterNode doesn't use this helper

cloneSnapshotWithCurrentHealth is now used by GetNode (line 300) and ListNodes (line 310), and UpdateNodeStatus calls applyCurrentHealth inline (line 239). But RegisterNode (line 199, unchanged) still returns a plain cloneSnapshot(snap) without calling applyCurrentHealth.

The caller gets Healthy=false (zero value) with an empty UnhealthyReason — rather than UnhealthyReason="HeartbeatExpired" which would be the consistent answer for a node that has never sent a heartbeat yet.

Should RegisterNode also use cloneSnapshotWithCurrentHealth for consistency?

cubesandboxbot · 2026-06-04T09:32:31Z

Claude finished @zyl1121's task in 2m 20s —— View job

Claude is analyzing the latest PR state...

Progress

Analyze all 5 inline review comments from @fslongjin
Verify current code state against each concern
Check if remaining fixes are needed
Post analysis

Analysis of @fslongjin's Review Against Current PR State

Commit faba88d ("tighten stale-node health follow-ups") was pushed after @fslongjin's review and addresses all technical concerns. Here's the per-comment breakdown:

1. ✅ prefilter.go — Redundant `MetaDataUpdateAt` timeout check

Status: Fixed in faba88d

The redundant metaDataUpdateAtTimeout variable and the time.Since(n.MetaDataUpdateAt) > metaDataUpdateAtTimeout guard were both removed. The prefilter now relies solely on n.Healthy (line 47), which is already recomputed from heartbeat freshness by GetHealthyNodesByInstanceType → cloneNodeWithCurrentHealth → nodehealth.EvaluateFromFacts.

2. ✅ node.go:Clone() — Non-atomic read of `LocalCreateNum`

Status: Fixed in faba88d

Clone() now performs an atomic.LoadInt64(&n.LocalCreateNum) before the struct copy and overwrites the copied field, with an explanatory comment:

// Clone provides a best-effort read-side snapshot. Mutable counters such
// as LocalCreateNum are refreshed via atomic loads after the structural
// copy so cloned read models stay aligned with the write path.
localCreateNum := atomic.LoadInt64(&n.LocalCreateNum)
cloned := *n
cloned.LocalCreateNum = localCreateNum

3. ✅ service.go — Duplicate health evaluation

Status: Fixed in faba88d

toSchedulerNode no longer calls currentHealthStatus(snap, time.Now()). Instead it reads snap.Healthy and snap.UnhealthyReason directly — fields that were already set by applyCurrentHealth earlier in the call chain (UpdateNodeStatus line 240). There is now exactly one health evaluation per heartbeat update.

4. 📢 node.go:Healthy JSON tag — `omitempty` removed

Status: Noted, no code change needed

The omitempty removal is a deliberate wire-format change (confirmed by TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady). As @fslongjin suggested, this should be called out in the PR description and next changelog entry — that's a documentation/communication task, not a code fix.

5. ✅ RegisterNode — Inconsistency with `cloneSnapshotWithCurrentHealth`

Status: Fixed in faba88d

Commit faba88d added applyCurrentHealth(snap, time.Now()) inside RegisterNode (line 197) before returning cloneSnapshot(snap) (line 200). This means the returned snapshot now has correct health (e.g., Healthy=false, UnhealthyReason="HeartbeatExpired" for a fresh node) rather than zero-value fields.

The function still returns cloneSnapshot(snap) rather than cloneSnapshotWithCurrentHealth(snap, time.Now()), but since applyCurrentHealth was just called, both produce identical results. This is consistent in practice.

Summary

All 5 inline review items from @fslongjin have been addressed:

#	Concern	Status
1	Redundant MetaDataUpdateAt check in prefilter	✅ Fixed
2	Non-atomic Clone() LocalCreateNum read	✅ Fixed
3	Duplicate health evaluation in toSchedulerNode	✅ Fixed
4	JSON omitempty API contract change	📢 Noted for docs
5	RegisterNode missing health application	✅ Fixed

No further code changes are needed. The PR is in good shape.

zyl1121 · 2026-06-04T09:38:29Z

Thanks for this PR — the facts-vs-evaluation separation (ReportedReady vs Healthy) and the centralized nodehealth package are clean architectural improvements. I left a few inline comments on areas that could be tightened up.

Summary of inline comments

prefilter.go — Redundant MetaDataUpdateAt timeout check; GetHealthyNodesByInstanceType already expires stale heartbeats via cloneNodeWithCurrentHealth.

node.go:Clone() — Struct copy reads LocalCreateNum non-atomically (data race per Go memory model).

service.go — Duplicate health evaluation in UpdateNodeStatus (line 239) and toSchedulerNode (line 412); two time.Now() calls on the same snapshot.

node.go:Healthy JSON tag — omitempty removed; API contract change worth calling out explicitly.

service.go:RegisterNode — Returns cloneSnapshot without applying current health, inconsistent with GetNode/ListNodes/UpdateNodeStatus.

Thanks for the detailed review. I addressed the issues you found in follow-up commit, and I also updated the PR description to explicitly mention that healthy=false is now emitted in JSON output.

fslongjin · 2026-06-05T08:10:29Z

Hi @zyl1121, thanks for this PR!

The DCO check is currently failing because all 4 commits are missing the Signed-off-by trailer. The CubeSandbox project requires every commit to carry a DCO sign-off.

Could you please:

Amend each commit to add your Signed-off-by line (e.g., Signed-off-by: zhengyilei <zhengyilei@myhexin.com>)
Force push the updated branch

You can do this with an interactive rebase:

git rebase -i HEAD~4
# mark each commit as 'reword', and append:
# Signed-off-by: zhengyilei <zhengyilei@myhexin.com>

Or more conveniently use git rebase --signoff:

git rebase --signoff HEAD~4
git push --force-with-lease

Once all commits have the sign-off, the DCO check should pass. Thanks!

cubesandboxbot · 2026-06-05T08:48:37Z

Claude finished @zyl1121's task in 1m 22s —— View job

PR #455 Re-trigger Analysis

Progress

Gather context — check git state and new commits
Analyze the request and determine what action to take
Update GitHub comment with results

Current State

This is a re-trigger on a PR synchronize event. After examining the branch state:

✅ No new commits since last review

The same 4 commits reviewed previously are still present (latest: 7ebdb38, authored 2026-06-04 17:31:37 +0800). There are no new changes to review.

The PR is in good shape:

All previous review feedback from @fslongjin was addressed in commit faba88d / 7ebdb38
@fslongjin approved the PR on 2026-06-05
Test coverage is thorough (unit tests for localcache, nodemeta, prefilter)
The architecture (centralized nodehealth package, immutable read paths, prefilter hardening) is solid

⚠️ Remaining item: DCO sign-off

@fslongjin noted on 2026-06-05 that the DCO check is failing because all 4 commits are missing the Signed-off-by trailer. @zyl1121 (the author), you'll need to add the sign-off trailers. The easiest approach:

git rebase --signoff HEAD~4
git push --force-with-lease

This adds Signed-off-by: Your Name <your.email@example.com> to each commit. I cannot perform this step on your behalf, as the sign-off must come from the actual author.

Summary

Item	Status
Code review feedback addressed	✅ All 5 items fixed in follow-up commits
PR approved by @fslongjin	✅
DCO check passing	❌ Needs author action
New commits to review	None — same 4 commits as before

Once the DCO sign-off is applied, this PR should be ready to merge.

zyl1121 · 2026-06-05T08:54:19Z

Hi @zyl1121, thanks for this PR!

The DCO check is currently failing because all 4 commits are missing the Signed-off-by trailer. The CubeSandbox project requires every commit to carry a DCO sign-off.

Could you please:

Amend each commit to add your Signed-off-by line (e.g., Signed-off-by: zhengyilei <zhengyilei@myhexin.com>)

Force push the updated branch

You can do this with an interactive rebase:
git rebase -i HEAD~4
# mark each commit as 'reword', and append:
# Signed-off-by: zhengyilei <zhengyilei@myhexin.com>
Or more conveniently use git rebase --signoff:
git rebase --signoff HEAD~4
git push --force-with-lease
Once all commits have the sign-off, the DCO check should pass. Thanks!

Fixed and force-pushed the branch with signed-off commits. Thanks!

Derive current node health from the last reported readiness fact and heartbeat freshness so stale heartbeats no longer remain healthy in nodemeta reads, localcache-backed reads, or scheduler prefilter. Signed-off-by: zhengyilei <zheng_yilei@qq.com>

Reuse a shared helper for nodemeta heartbeat expiry so localcache, nodemeta, and prefilter keep the same timeout rule and avoid drift. Signed-off-by: zhengyilei <zheng_yilei@qq.com>

Remove the added file header comments from the new Go files to match maintainer review feedback and keep this PR aligned with current repo expectations. Signed-off-by: zhengyilei <zheng_yilei@qq.com>

Remove the redundant prefilter heartbeat timeout check, carry the current health state through nodemeta registration and scheduler projection, and harden read-side cloning around mutable node state. Signed-off-by: zhengyilei <zheng_yilei@qq.com>

cubesandboxbot · 2026-06-05T09:03:33Z

Claude finished @zyl1121's task in 34s —— View job

Progress

Analyze PR synchronize event context
Check for new commits on the branch
Verify DCO sign-off status
Update comment with findings

Analysis — PR #455 Synchronize Event

Current State

This is a re-trigger on a PR synchronize event. After examining the branch:

✅ No new commits since last review

The same 4 commits reviewed previously are still present (latest: 0aac550, all with DCO sign-off added). There are no new code changes to review.

✅ DCO sign-off confirmed

All 4 commits now carry the Signed-off-by: zhengyilei <zheng_yilei@qq.com> trailer:

3175c3e cubemaster: expire node health after stale nodemeta heartbeat
d1b27d4 cubemaster: centralize nodemeta heartbeat timeout
6c09b10 fix(CubeMaster): drop extra header comments
0aac550 fix(CubeMaster): tighten stale-node health follow-ups

The DCO check should now pass.

Summary

Item	Status
Code review feedback addressed	✅ All 5 items fixed
PR approved by @fslongjin	✅
DCO sign-off applied	✅
New commits to review	None — same 4 commits

The PR is fully ready — all review feedback has been addressed, the DCO issue has been resolved, and the PR has been approved. No further action needed from my side.

zyl1121 requested review from chenhengqi, fslongjin, ls-ggg, tinklone and up2wing as code owners June 4, 2026 02:14

zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from c8b8153 to 5eb0cda Compare June 4, 2026 02:55

zyl1121 changed the title ~~cubemaster: expire stale nodemeta node health~~ fix(CubeMaster): expire node health after stale nodemeta heartbeat Jun 4, 2026

fslongjin reviewed Jun 4, 2026

View reviewed changes

zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from 5eb0cda to 10fcd30 Compare June 4, 2026 08:08

fslongjin reviewed Jun 4, 2026

View reviewed changes

fslongjin approved these changes Jun 5, 2026

View reviewed changes

zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from faba88d to 7ebdb38 Compare June 5, 2026 08:48

zhengyilei added 4 commits June 5, 2026 17:02

cubemaster: centralize nodemeta heartbeat timeout

d1b27d4

Reuse a shared helper for nodemeta heartbeat expiry so localcache, nodemeta, and prefilter keep the same timeout rule and avoid drift. Signed-off-by: zhengyilei <zheng_yilei@qq.com>

fix(CubeMaster): drop extra header comments

6c09b10

Remove the added file header comments from the new Go files to match maintainer review feedback and keep this PR aligned with current repo expectations. Signed-off-by: zhengyilei <zheng_yilei@qq.com>

zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from 7ebdb38 to 0aac550 Compare June 5, 2026 09:03

		@@ -419,7 +438,9 @@ func toSchedulerNode(snap NodeSnapshot) node.Node {
		OssClusterLabel: snap.ClusterLabel,

		UnhealthyReason string `json:"UnhealthyReason,omitempty"`

		Score float64 `json:"Score,omitempty"`

Conversation

zyl1121 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What this PR changes

Testing

Documentation

Related issue

Uh oh!

cubesandboxbot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Code Review — PR #455

Progress

Overall Assessment

Design Review

1. Clean separation of concerns (nodehealth package) ✅

2. Immutability on read paths ✅

3. Prefilter hardening ✅

4. Proper JSON serialization ✅

Potential Issues & Observations

1. Duplicated timeout constant (minor)

2. Lock scope in downNodeCache (very minor)

3. Fast path removed in GetHealthyNodesByInstanceType

4. appendSortedNodes reads n.Healthy without health recomputation

Test Coverage

Summary

Uh oh!

cubesandboxbot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #455

Progress

Overall Assessment

Design Review

1. nodehealth package — clean abstraction ✅

2. Read-path immutability ✅

3. Prefilter hardening ✅

4. JSON serialization ✅

Observations

1. metadataHealthTimeout() / healthTimeout() duplication

2. appendSortedNodes uses cached Healthy directly

3. Redundant health checks in prefilter

4. downNodeCache mutates cached node in-place

5. incrementNodeCacheCounts in node_cache.go — consistent update

Test Coverage

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review after commit 10fcd30 — Extra header comments removed

Progress

✅ Fix verified

Uh oh!

fslongjin left a comment

Choose a reason for hiding this comment

Summary of inline comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Analysis of @fslongjin's Review Against Current PR State

1. ✅ prefilter.go — Redundant MetaDataUpdateAt timeout check

2. ✅ node.go:Clone() — Non-atomic read of LocalCreateNum

3. ✅ service.go — Duplicate health evaluation

4. 📢 node.go:Healthy JSON tag — omitempty removed

zyl1121 commented Jun 4, 2026 •

edited

Loading

cubesandboxbot Bot commented Jun 4, 2026 •

edited

Loading

1. Clean separation of concerns (`nodehealth` package) ✅

2. Lock scope in `downNodeCache` (very minor)

3. Fast path removed in `GetHealthyNodesByInstanceType`

4. `appendSortedNodes` reads `n.Healthy` without health recomputation

cubesandboxbot Bot commented Jun 4, 2026 •

edited

Loading

1. `nodehealth` package — clean abstraction ✅

1. `metadataHealthTimeout()` / `healthTimeout()` duplication

2. `appendSortedNodes` uses cached `Healthy` directly

4. `downNodeCache` mutates cached node in-place

5. `incrementNodeCacheCounts` in `node_cache.go` — consistent update

cubesandboxbot Bot commented Jun 4, 2026 •

edited

Loading

Re-review after commit `10fcd30` — Extra header comments removed

cubesandboxbot Bot commented Jun 4, 2026 •

edited

Loading

1. ✅ prefilter.go — Redundant `MetaDataUpdateAt` timeout check

2. ✅ node.go:Clone() — Non-atomic read of `LocalCreateNum`

4. 📢 node.go:Healthy JSON tag — `omitempty` removed

5. ✅ RegisterNode — Inconsistency with `cloneSnapshotWithCurrentHealth`

zyl1121 commented Jun 4, 2026 •

edited

Loading

cubesandboxbot Bot commented Jun 5, 2026 •

edited

Loading

cubesandboxbot Bot commented Jun 5, 2026 •

edited

Loading